Non-GPU-resident symmetric indefinite factorization

نویسندگان

  • Ichitaro Yamazaki
  • Stanimire Tomov
  • Jack J. Dongarra
چکیده

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, USA Correspondence Ichitaro Yamazaki, Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, USA. Email: [email protected] Funding information National Science Foundation NVIDIAMatrix Algebra for GPU andMulticore Architectures (MAGMA) for Large Petascale Systems, Grant/Award Number: ACI-1339822 Summary We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge toobtainhighperformanceof suchanalgorithmwhen the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each columnof thematrix, a diagonal entry,which ensures the stability,mayneed tobe selectedasapivot among the remainingdiagonals, andmoved to the leadingdiagonalby swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, calledpanel, and thenuses the factorizedpanel toupdate the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficientout-of-core symmetric indefinite factorizationcomparedwithanout-of-corenonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combinationmay still require to load the pivots into the corememory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to theGPUandCPUmemory, respectively.We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multifrontal Computations on GPUs and Their Multi-core Hosts

The use of GPUs to accelerate the factoring of large sparse symmetric indefinite matrices shows the potential of yielding important benefits to a large group of widely used applications. This paper examines how a multifrontal sparse solver performs when exploiting both the GPU and its multi-core host. It demonstrates that the GPU can dramatically accelerate the solver relative to one host CPU. ...

متن کامل

Solving dense symmetric indefinite systems using GPUs

This paper studies the performance of different algorithms for solving a dense symmetric indefinite linear system of equations on multicore CPUs with a Graphics Processing Unit (GPU). To ensure the numerical stability of the factorization, pivoting is required. Obtaining high performance of such algorithms on the GPU is difficult because all the existing pivoting strategies lead to frequent syn...

متن کامل

Solving Hermitian positive definite systems using indefinite incomplete factorizations

Incomplete LDL factorizations sometimes produce an indefinite preconditioner evenwhen the input matrix is Hermitian positive definite. The two most popular iterative solvers for symmetric systems, CG and MINRES, cannot use such preconditioners; they require a positive definite preconditioner. One approach, that has been extensively studied to address this problem is to force positive definitene...

متن کامل

Analysis of Block LDL Factorizations for Symmetric Indefinite Matrices∗

We consider the block LDL factorizations for symmetric indefinite matrices in the form LBL , where L is unit lower triangular and B is block diagonal with each diagonal block having dimension 1 or 2. The stability of this factorization and its application to solving linear systems has been well-studied in the literature. In this paper we give a condition under which the LBL factorization will r...

متن کامل

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Concurrency and Computation: Practice and Experience

دوره 29  شماره 

صفحات  -

تاریخ انتشار 2017